Purpose: the purpose of this notebook is to compare the most recent gimap results to those of the original Parrish et al paper results. Although we don’t need to exactly replicate the data and can’t really, we do want to ensure that they aren’t any particular biases or trends showing up that are not related to biology.
Conclusion TL;DR The data looks pretty similar now. There are certainly some outliers that differ but there are no particular trends that I can see as far as samples, probes, or data steps. Basically the differences that exist seem mostly random.
We’re going to load the brand new version of gimap that is on this branch.
devtools::load_all("..")
Calculate new version which combines into one linear model and uses averages across replicates for expected values.
Annotating Data
Downloading: 3.4 kB
Downloading: 3.4 kB
Downloading: 3.4 kB
Downloading: 3.4 kB
Downloading: Achilles_common_essentials.csv
|
| | 0%
|
|===================================================================================| 100%
|
| | 0%
|
|======================================================== | 67%
|
|===================================================================================| 100%
Normalizing Log Fold Change
Calculating Genetic Interaction scores
Downloaded the paper results from online supplementary information
Load in crispr scores from the paper. Make the names so they match and reshape the data.
paper_cs <- readxl::read_xlsx("parrish_crispr_scores.xlsx", skip = 1) %>%
dplyr::select(pg_ids = pgRNA,
Day22_RepA_late = RepA_CRISPR_score,
Day22_RepB_late= RepB_CRISPR_score,
Day22_RepC_late = RepC_CRISPR_score
) %>%
tidyr::pivot_longer(-pg_ids,
names_to = "rep",
values_to = "crispr_score")
Load in gi scores from the paper.
paper_gi <- readxl::read_xlsx("parrish_gi_scores.xlsx", skip = 1) %>%
# A little cleaning.
dplyr::mutate(
pgRNA_target = gsub("\\|", "_", paralog_pair))
both_df <- dplyr::left_join(gimap_dataset$gi_scores, paper_gi, by = "pgRNA_target")
Are the expected values similar? Yes
ggplot(both_df, aes(x = HeLa_DKO_expected_CS, mean_expected_cs)) +
geom_point() +
theme_bw() +
ylab("gimap mean expected CRISPR values") +
xlab("Parrish et al mean expected CRISPR values")
Warning: Removed 2060 rows containing missing values or values outside the scale range
(`geom_point()`).
How do the observed values line up? Mostly good but a couple wacky ones.
ggplot(both_df, aes(x = HeLa_DKO_observed_CS, mean_observed_cs)) +
geom_point() +
theme_bw() +
ylab("gimap mean observed CRISPR values") +
xlab("Parrish et al mean observed CRISPR values")
Warning: Removed 2060 rows containing missing values or values outside the scale range
(`geom_point()`).
What’s that point that super weird in observed?
It’s NBPF3_NBPF15 what’s its problem?
both_df %>%
dplyr::mutate(diff = abs(HeLa_DKO_observed_CS - mean_observed_cs)) %>%
dplyr::select(pgRNA_target, diff, HeLa_DKO_observed_CS, mean_observed_cs) %>%
dplyr::arrange(desc(diff))
How do the GI scores line up? Honestly not bad.
ggplot(both_df, aes(x = HeLa_GI_score, gi_score)) +
geom_point() +
theme_bw() +
ylab("gimap GI score") +
xlab("Parrish et al GI score")
Warning: Removed 2060 rows containing missing values or values outside the scale range
(`geom_point()`).
Which points have the biggest differentials?
both_df %>%
dplyr::mutate(diff = abs( HeLa_GI_score - gi_score)) %>%
dplyr::select(pgRNA_target, diff, HeLa_GI_score, gi_score) %>%
dplyr::arrange(desc(diff))
How do the FDR scores line up? Not too too bad but lets see what wacky.
ggplot(both_df, aes(x = -log10(HeLa_GI_fdr), -log10(fdr))) +
geom_point() +
theme_bw() +
ylab("gimap fdr") +
xlab("Parrish et al GI score")
Warning: Removed 2060 rows containing missing values or values outside the scale range
(`geom_point()`).
Which targets have the biggest differentials?
both_df %>%
dplyr::mutate(diff = abs(HeLa_GI_fdr - fdr)) %>%
dplyr::select(pgRNA_target, diff, HeLa_GI_fdr, fdr) %>%
dplyr::arrange(desc(diff))
Hypothesis: If we calculate something a bit wonky in the early stages of the data handling does that effect how much we are different from the original at the later stages of the data?
Let’s reformat this data a bit to take a look at absolute differences of the calculations at the various steps.
data_step <- both_df %>%
dplyr::mutate(
observed_diff = abs(HeLa_DKO_observed_CS - mean_observed_cs),
expected_diff = abs(HeLa_DKO_expected_CS - mean_expected_cs),
gi_diff = abs(HeLa_GI_score - gi_score),
fdr_diff = abs(-log10(HeLa_GI_fdr) - -log10(fdr))) %>%
dplyr::select(pgRNA_target, observed_diff, expected_diff, gi_diff, fdr_diff) %>%
tidyr::pivot_longer(-pgRNA_target,
names_to = "stat",
values_to = "diff"
) %>%
dplyr::mutate(stat = factor(stat,
levels = c("observed_diff",
"expected_diff",
"gi_diff",
"fdr_diff")))
Most of the stats have a small difference but some have a small minority have a big difference between what was found in the paper and what the new version of the software finds.
Note I have not put fdr on a -log10 scale but it should be on a different scale but it looked wonky because the other stats aren’t in this range. So just take that with a grain of salt.
ggplot(data_step,
aes(x = stat, y = diff, group = stat)) +
geom_violin() +
theme_bw()
Warning: Removed 8240 rows containing non-finite outside the scale range (`stat_ydensity()`).
Here I’m just plotting some of the targets that had the biggest stat differentials in one or another of the calculations and seeing what happens to that differential at later steps.
data_step %>%
dplyr::filter(pgRNA_target %in%
c("PPP1R14C_PPP1R14B", "NKAPL_NKAP", "DDX3Y_DDX3X",
"DHFR2_DHFR", "ASF1A_ASF1B", "NBPF3_NBPF15",
"EIF4E1B_EIF4E", "SEC61A2_SEC61A1", "COPG2_COPG1")) %>%
ggplot(aes(x = stat, y = diff, group = pgRNA_target, color = pgRNA_target)) +
geom_point() +
geom_line() +
theme_bw()
I’m not seeing much of a pattern. It seem very random. Which I think is good?
Let’s match up the paper data and the new gimap data together.
both_cs_df <- gimap_dataset$normalized_log_fc %>%
dplyr::select(pgRNA_target, pg_ids, rep, crispr_score) %>%
dplyr::left_join(paper_cs, by = c("pg_ids", "rep"),
suffix = c("_gimap", "_paper"))
Plot probe level pg_ids. There’s some higher differences between the paper and gimap software for the very negative CRISPR scores but not too bad.
both_cs_df %>%
ggplot(aes(crispr_score_gimap, crispr_score_paper)) +
geom_point() +
facet_wrap(~rep)
Let’s look at the construct variability for probes for each target.
inter_target <- both_cs_df %>%
dplyr::group_by(pgRNA_target, rep) %>%
dplyr::summarize(paper_sd = sd(crispr_score_paper, na.rm = TRUE),
gimap_sd = sd(crispr_score_gimap, na.rm = TRUE),
paper_mean = mean(crispr_score_paper, na.rm = TRUE),
gimap_mean = mean(crispr_score_gimap, na.rm = TRUE),
.groups = "drop") %>%
dplyr::mutate(sd_diff = abs(paper_sd - gimap_sd))
What does within target variability look like between the two data sources?
inter_target %>%
ggplot(aes(x = paper_sd, y = gimap_sd, color = rep)) +
geom_point() +
theme_bw()
Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
Does this change across samples?
What does the within target standard deviation look like for the software?
ggplot(inter_target, aes(x = rep, y = gimap_sd, color = rep)) +
geom_violin() +
theme_bw()
Warning: Removed 3 rows containing non-finite outside the scale range (`stat_ydensity()`).
How about the paper? Pretty similar.
ggplot(inter_target, aes(x = rep, y = paper_sd, color = rep)) +
geom_violin() +
theme_bw()
Warning: Removed 3 rows containing non-finite outside the scale range (`stat_ydensity()`).
So you know what software was used to calculate this take a look here
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS 15.3.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gimap_1.0.3 testthat_3.2.3 magrittr_2.0.3 ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4 fastmap_1.2.0
[5] janitor_2.2.1 promises_1.3.2 digest_0.6.37 timechange_0.3.0
[9] mime_0.12 lifecycle_1.0.4 ellipsis_0.3.2 compiler_4.4.0
[13] sass_0.4.9 rlang_1.1.5 tools_4.4.0 yaml_2.3.10
[17] knitr_1.49 labeling_0.4.3 askpass_1.2.1 htmlwidgets_1.6.4
[21] bit_4.5.0.1 pkgbuild_1.4.6 curl_6.2.0 RColorBrewer_1.1-3
[25] pkgload_1.4.0 miniUI_0.1.1.1 withr_3.0.2 purrr_1.0.4
[29] desc_1.4.3 grid_4.4.0 urlchecker_1.0.1 profvis_0.4.0
[33] xtable_1.8-4 colorspace_2.1-1 scales_1.3.0 cli_3.6.3
[37] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3 remotes_2.5.0.9000
[41] rstudioapi_0.17.1 httr_1.4.7 tzdb_0.4.0 sessioninfo_1.2.2
[45] readxl_1.4.3 cachem_1.1.0 stringr_1.5.1 parallel_4.4.0
[49] cellranger_1.1.0 vctrs_0.6.5 devtools_2.4.5 jsonlite_1.8.9
[53] hms_1.1.3 bit64_4.6.0-1 tidyr_1.3.1 jquerylib_0.1.4
[57] glue_1.8.0 lubridate_1.9.4 stringi_1.8.4 gtable_0.3.6
[61] later_1.4.1 munsell_0.5.1 tibble_3.2.1 pillar_1.10.1
[65] htmltools_0.5.8.1 brio_1.1.5 openssl_2.3.2 R6_2.5.1
[69] rprojroot_2.0.4 vroom_1.6.5 evaluate_1.0.3 shiny_1.10.0
[73] readr_2.1.5 backports_1.5.0 pheatmap_1.0.12 memoise_2.0.1
[77] broom_1.0.7 snakecase_0.11.1 bslib_0.8.0.9000 httpuv_1.6.15
[81] Rcpp_1.0.14 xfun_0.50 fs_1.6.5 usethis_3.1.0
[85] pkgconfig_2.0.3